There are many roads here in Maryland witha reputation for being dangerous. For example, here in College Park, Route One has something of a reputation for being dangerous.
For this project I wanted to evaluate which roads in Maryland are actually the most dangerous and see if I could get some insights about why this may be the case. I found a dataset on Maryland's opendata of all crashes in Maryland since 2015. I downloaded it locally since opendata.maryland.gov goes down for maintenence not infrequently, but attatched is a link to where I got the dataset: https://opendata.maryland.gov/Public-Safety/Maryland-Statewide-Vehicle-Crashes/65du-s3qu.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn
import folium
from datetime import datetime
crashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')
/opt/conda/lib/python3.9/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import Int64Index as NumericIndex /tmp/ipykernel_207/2716668262.py:10: DtypeWarning: Columns (33,45) have mixed types. Specify dtype option on import or set low_memory=False. crashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')
The dataset was large to a point where my computer was struggling, so since I'm looking to evaluate roads in Maryland, I took only crashes with latitude and longitude values, since I planned on mapping to see where crashes occur. I also reduced the dataset by only looking at 2022 and Q3 since it was the most recent complete quarter of data available and I want to find out what the current most dangerous road is.
crashes = crashes[pd.notnull(crashes['LATITUDE'])]
crashes = crashes[pd.notnull(crashes['LONGITUDE'])]
crashes = crashes[crashes['YEAR'] == 2022]
crashes = crashes[crashes['QUARTER'] == 'Q3']
crashes['datetime'] = crashes['ACC_DATE'].astype(str)+crashes['ACC_TIME'].astype(str)
crashes['datetime'] = crashes['datetime'].apply(lambda x: datetime.strptime(x,"%Y%m%d%H:%M:%S"))
display(crashes)
| YEAR | QUARTER | LIGHT_DESC | LIGHT_CODE | COUNTY_DESC | COUNTY_NO | MUNI_DESC | MUNI_CODE | JUNCTION_DESC | JUNCTION_CODE | ... | FEET_MILES_FLAG | DISTANCE_DIR_FLAG | REFERENCE_NO | REFERENCE_TYPE_CODE | REFERENCE_SUFFIX | REFERENCE_ROAD_NAME | LATITUDE | LONGITUDE | LOCATION | datetime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 726311 | 2022 | Q3 | Daylight | 1.0 | Washington | 21.0 | NaN | 0.0 | Not Applicable | 0.00 | ... | F | W | 714.0 | CO | NaN | SUMMERS LA | 39.595377 | -77.681919 | POINT (-77.6819192 39.5953773) | 2022-07-07 11:14:00 |
| 726344 | 2022 | Q3 | Daylight | 1.0 | Wicomico | 22.0 | NaN | 0.0 | Intersection | 2.00 | ... | F | S | 23.0 | CO | NaN | NORRIS TWILLEY RD | 38.515646 | -75.709238 | POINT (-75.709238 38.5156459) | 2022-07-09 17:09:00 |
| 726594 | 2022 | Q3 | Daylight | 1.0 | Wicomico | 22.0 | NaN | 96.0 | Intersection | 2.00 | ... | F | E | 313.0 | MD | NaN | DELMAR RD | 38.462580 | -75.750820 | POINT (-75.75082 38.46258) | 2022-08-03 19:50:00 |
| 726629 | 2022 | Q3 | Daylight | 1.0 | Worcester | 23.0 | NaN | 0.0 | NaN | 9.04 | ... | F | N | 0.0 | UU | NaN | SPUR TO US 50 | 38.327654 | -75.106503 | POINT (-75.106503303235 38.327653546825) | 2022-08-11 13:00:00 |
| 726712 | 2022 | Q3 | Daylight | 1.0 | Baltimore City | 24.0 | NaN | NaN | Intersection | 2.00 | ... | F | E | NaN | NaN | NaN | DRUID HILL AVE | 39.297891 | -76.612438 | POINT (-76.61243800045 39.297890781133) | 2022-09-14 09:07:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 848799 | 2022 | Q3 | Daylight | 1.0 | Anne Arundel | 2.0 | NaN | 0.0 | Not Applicable | 0.00 | ... | M | E | 607.0 | CO | NaN | OLD STAGE RD | 39.158164 | -76.628893 | POINT (-76.62889296443 39.158163996637) | 2022-07-16 13:48:00 |
| 848800 | 2022 | Q3 | Daylight | 1.0 | Baltimore | 3.0 | NaN | 0.0 | NaN | 88.00 | ... | F | N | 1300.0 | CO | NaN | REGESTER AVE | 39.378626 | -76.608568 | POINT (-76.608567616667 39.378625966667) | 2022-08-26 18:13:00 |
| 848801 | 2022 | Q3 | Daylight | 1.0 | Baltimore | 3.0 | NaN | 0.0 | Not Applicable | 0.00 | ... | F | W | 4095.0 | CO | NaN | EBENEZER RD | 39.387539 | -76.418305 | POINT (-76.418304879322 39.387539278395) | 2022-07-12 16:53:00 |
| 848802 | 2022 | Q3 | Daylight | 1.0 | Anne Arundel | 2.0 | NaN | 0.0 | Non Intersection | 1.00 | ... | F | N | 12.0 | SR | NaN | OCEANIC DR | 39.019090 | -76.403650 | POINT (-76.40365 39.01909) | 2022-08-11 17:46:00 |
| 848803 | 2022 | Q3 | Dark Lights On | 3.0 | Baltimore | 3.0 | NaN | 0.0 | Non Intersection | 1.00 | ... | F | E | 2189.0 | CO | NaN | PARMELEE RD | 39.367800 | -76.762110 | POINT (-76.76211 39.3678) | 2022-08-02 01:00:00 |
26392 rows × 56 columns
I started out with some basic data exploration. I got the number of crashes for each road documented. I then narrowed down to the 200 most dangerous roads in Maryland as defined by number of crashes. By this metric, Route 1 is the third most dangerous road in Maryland in quarter 3 of 2022, with 517 crashes. This metric is admittedly flawed, since it doesn't take into account road length or capacity or anything of the sort, but for exploration, seeing where the most crashes take place by sheer number is useful.
roadData = crashes['RTE_NO'].value_counts()
roadData = pd.DataFrame({'road':roadData.index,'crashes':roadData.values})
roadData = roadData.head(100)
display(roadData)
display(roadData[roadData['road']==1.0])
| road | crashes | |
|---|---|---|
| 0 | 95.0 | 1129 |
| 1 | 695.0 | 542 |
| 2 | 1.0 | 517 |
| 3 | 50.0 | 449 |
| 4 | 40.0 | 448 |
| ... | ... | ... |
| 95 | 114.0 | 31 |
| 96 | 33.0 | 31 |
| 97 | 564.0 | 30 |
| 98 | 117.0 | 30 |
| 99 | 340.0 | 30 |
100 rows × 2 columns
| road | crashes | |
|---|---|---|
| 2 | 1.0 | 517 |
To figure out if the number of crashes on any given road is significantly different from the average number of crashes on the 200 most crashed on roads, I did a Z-test, since the sample size is large. I found that Route One does have significantly more crashes than the other top 100 roads in Maryland, witha p value of .004.
import scipy.stats
roadData['z-score'] = roadData['crashes'].apply(lambda x:(roadData['crashes'].mean()-x)/roadData['crashes'].std())
roadData['p-value'] = roadData['z-score'].apply(lambda x: scipy.stats.norm.sf(abs(x))*2)
roadData['Significant'] = roadData['p-value'].apply(lambda x: x<.05)
display(roadData.head(10))
| road | crashes | z-score | p-value | Significant | |
|---|---|---|---|---|---|
| 0 | 95.0 | 1129 | -6.766122 | 1.322797e-11 | True |
| 1 | 695.0 | 542 | -2.839874 | 4.513129e-03 | True |
| 2 | 1.0 | 517 | -2.672658 | 7.525298e-03 | True |
| 3 | 50.0 | 449 | -2.217828 | 2.656654e-02 | True |
| 4 | 40.0 | 448 | -2.211140 | 2.702616e-02 | True |
| 5 | 70.0 | 367 | -1.669358 | 9.504652e-02 | False |
| 6 | 301.0 | 326 | -1.395122 | 1.629790e-01 | False |
| 7 | 270.0 | 323 | -1.375056 | 1.691140e-01 | False |
| 8 | 5.0 | 312 | -1.301481 | 1.930939e-01 | False |
| 9 | 495.0 | 310 | -1.288104 | 1.977099e-01 | False |
Now that I have the 100 most dangerous roads in Marlyand by sheer number of crashes, I want to go back to the original dataset to figure out if I can learn about why these roads are so dangerous. To do this, I took my dataset and removed all entries that didn't take place on my list of the top 100 most dangerous roads so that I could look at their crashes in the context of the full dataset.
roads = roadData['road'].tolist()
crashes = crashes[crashes['RTE_NO'].isin(roads)]
I mapped all of the crashes from the top ten most crashed on roads on Maryland in order to see if this could provide us any insight about where the most crashes occur. After limiting the crash data to only those belonging to the top ten roads, I mapped outliers in the data in a dark blue and the rest in a lighter blue.
map_osm = folium.Map(location=[39.0458, -76.6413], zoom_start=8)
roads = roadData['road'].head(10)
top10roads = crashes[crashes['RTE_NO'].isin(roads)]
for index,crash in top10roads.iterrows():
if crash['RTE_NO'] ==95 or crash['RTE_NO'] ==695 or crash['RTE_NO'] ==1 or crash['RTE_NO'] ==50 or crash['RTE_NO'] ==40:
folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
color='#0E345B').add_to(map_osm)
else:
folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
color='#5D8EC1').add_to(map_osm)
map_osm
We can see some interesting things. For one, Some roads only have crashes intermittently, such as route 40, which only has high crash numbers near population centers, such as Hagerstown. However, most roads have relatively uniform numbers of crashes along their length, such as 95 and 270.
We can see that our outlier roads are those going through Baltimore and DC, such as the Baltimore Beltway, Route one, and 95.
Interestingly, Route 50 is also an outlier, even though for most of it's length it's over the bay or on the peninsula as opposed to near DC or Baltimore like the rest of the outliers. We can see a high density of crashes on the bay bridge, which is decidedly not a population center. It is however something of a choke point, if people want to cross the bay, they must pass over it, meaning that it's level of traffic will be very high, leading to more crashes.
Now that we've done this, we should get back to the fact not all of these roads are the same size. If we get the lengths of these roads, we should be able to divide number of crashes by length to get crashes per mile and essentially standardize the crash counts, allowing us to compare these roads on equal footing. In order to do this, I got the length of roads manually and added each in since I couldn't find a good dataset of road lengths in Maryland that could be easily joined on.
roadData['length'] = 0
roadData.index = roadData['road']
roadData.at[95,'length'] = 109.01
roadData.at[695,'length'] = 51.46
roadData.at[1,'length'] = 80.25
roadData.at[50,'length'] =150.06
roadData.at[40,'length'] =221.31
roadData.at[70,'length'] =93.62
roadData.at[301,'length'] =122.85
roadData.at[270,'length'] =34.70
roadData.at[5,'length'] =74.34
roadData.at[495,'length'] =23.02
roadData.at[97,'length'] =55.27
roadData.at[2,'length'] =79.24
roadData.at[140,'length'] =38.50
roadData.at[355,'length'] =36.75
roadData.at[450,'length'] =30.19
roadData.at[193,'length'] =26.07
roadData.at[26,'length'] =44.1
roadData.at[29,'length'] =29.51
roadData.at[13,'length'] =42.48
roadData.at[4,'length'] =64.85
roadData.at[295,'length'] =32.52
roadData.at[650,'length'] =25.89
roadData.at[3,'length'] =9.56
roadData.at[45,'length'] =30.06
roadData.at[214,'length'] =24.97
roadData.at[528,'length'] =9.04
roadData.at[210,'length'] =20.86
roadData.at[202,'length'] =13.92
roadData.at[212,'length'] =10.43
roadData.at[410,'length'] =13.92
roadData.at[32,'length'] =51.79
roadData.at[7,'length'] =22.83
roadData.at[198,'length'] =14.14
roadData.at[100,'length'] =22.05
roadData.at[15,'length'] =37.85
roadData.at[28,'length'] =37.38
roadData.at[595,'length'] =19.97
roadData.at[150,'length'] =13.01
roadData.at[68,'length'] =18.5
roadData.at[83,'length'] =34.50
def crashpermilecalculator(crashes, miles):
if miles != 0:
return crashes / miles
else:
return np.nan
roadData['crashes per mile'] = roadData.apply(lambda x: crashpermilecalculator(x.crashes, x.length), axis=1)
roadData.sort_values(by='crashes per mile',ascending=False, inplace=True,)
Now that we have crashes per mile, we can figure out if any of these roads have a statistically significant difference in their crashes per mile. I used a Z test due to the large sample size of 40. I chose my significance level as .05.
cpm_std = roadData.head(40)['crashes per mile'].std()
cpm_mean = roadData.head(40)['crashes per mile'].mean()
roadData.index = range(0,100)
def crashpermileZcalculator(cpm):
return (cpm-cpm_mean)/cpm_std
roadData['crashes per mile z'] = roadData['crashes per mile'].apply(crashpermileZcalculator)
roadData['crashes per mile p'] = roadData['crashes per mile z'].apply(lambda x: scipy.stats.norm.sf(abs(x))*2)
roadData['cpm Significant'] = roadData['crashes per mile p'].apply(lambda x: x<.05)
print("Mean crashes per mile: ",cpm_mean)
print("Standard deviation crashes per mile: ",cpm_std)
display(roadData.head(10))
Mean crashes per mile: 5.736719044120439 Standard deviation crashes per mile: 3.0091851914620156
| road | crashes | z-score | p-value | Significant | length | crashes per mile | crashes per mile z | crashes per mile p | cpm Significant | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 495.0 | 310 | -1.288104 | 1.977099e-01 | False | 23.02 | 13.466551 | 2.568746 | 0.010207 | True |
| 1 | 3.0 | 120 | -0.017257 | 9.862318e-01 | False | 9.56 | 12.552301 | 2.264926 | 0.023517 | True |
| 2 | 528.0 | 110 | 0.049630 | 9.604173e-01 | False | 9.04 | 12.168142 | 2.137264 | 0.032577 | True |
| 3 | 695.0 | 542 | -2.839874 | 4.513129e-03 | True | 51.46 | 10.532452 | 1.593698 | 0.111004 | False |
| 4 | 95.0 | 1129 | -6.766122 | 1.322797e-11 | True | 109.01 | 10.356848 | 1.535342 | 0.124700 | False |
| 5 | 212.0 | 104 | 0.089762 | 9.284764e-01 | False | 10.43 | 9.971237 | 1.407197 | 0.159369 | False |
| 6 | 270.0 | 323 | -1.375056 | 1.691140e-01 | False | 34.70 | 9.308357 | 1.186912 | 0.235262 | False |
| 7 | 202.0 | 104 | 0.089762 | 9.284764e-01 | False | 13.92 | 7.471264 | 0.576417 | 0.564333 | False |
| 8 | 410.0 | 103 | 0.096451 | 9.231627e-01 | False | 13.92 | 7.399425 | 0.552544 | 0.580576 | False |
| 9 | 193.0 | 192 | -0.498841 | 6.178915e-01 | False | 26.07 | 7.364787 | 0.541033 | 0.588485 | False |
After doing this, we can see that Route One is only the fourteenth most dangerous road in Maryland. To find out which of these are statistically significant, we calculated z scores and p values for crashes per mile on these roads. We were able to use z scores because we have 40 samples for crashes per mile.
We now know that among the 100 most crashed on roads in Maryland, the average number of crashes per mile is 5.7, with a standard deviation of about 3.
From the p values, we can see that we have three outliers: 495, 3, and 528, which each have over 12 crashes per mile. We can see from earlier that in sheer number of crashes, only 495 even ranked in the top 10, but when we control for length, each has significantly more crashes per mile than the average.
We can now map our top 10 most dangerous roads by crashes per mile, once again using a darker blue for outliers and lighter for the rest.
map_osm = folium.Map(location=[39.0458, -76.6413], zoom_start=8)
roads = roadData['road'].head(10)
top10roadscpm = crashes[crashes['RTE_NO'].isin(roads)]
for index,crash in top10roadscpm.iterrows():
if crash['RTE_NO'] ==495 or crash['RTE_NO'] ==3 or crash['RTE_NO'] ==528 :
folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
color='#0E345B').add_to(map_osm)
else:
folium.Circle(location=[crash["LATITUDE"], crash["LONGITUDE"]],fill=True,opacity=0.8,
color='#5D8EC1').add_to(map_osm)
map_osm
This yield further insights. We can see that the most dangerous roads tend to be smaller roads neared to population centers. We can see more crashes where roads move near or through major cities like Washington or Baltimore. For the most part, the roads with high absolute numbers of crashes still have high crashes per mile, although the order of the rankings changes.
Interestingly, we can still see that the most dangerous roads are still tend to be nearer population centers. This was also true for the most dangerous roads by sheer number of crashes.
The outliers are interesting.
495 is right outside DC in a very busy area, that combined with it not having any length outside of this populated area to offset the crashes per mile leads to it having a very high number of crashes per mile, which makes sense.
Route 3 has two densely crashed in areas: near Glen Burnie and near Crofton. This is a densely traveled and small road, which leads to the high crashes per mile.
528 being here is strange. I looked a little more into it below.
528 is the most interesting road here. It's nowhere near Baltimore or DC, where we've had the most crashes in raw number. Instead, it's near ocean city. Ocean city sees a lot of seasonal travel and I'd exclusively looked at quarter 3 of 2022 to make my data more readable. This made me wonder if I'd accidentally captured a seasonal trend. I extracted the data on 528 in the table. This showed that quarters 2 and 3 (april through september) have far more crashes. This was a seasonal trend that I'd accidentally captured.
rawcrashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')
crashes528 = rawcrashes[pd.notnull(rawcrashes['LATITUDE'])]
crashes528 = crashes528[pd.notnull(crashes528['LONGITUDE'])]
crashes528 = crashes528[crashes528['RTE_NO']==528]
ax = crashes528['QUARTER'].value_counts().plot.bar(color = "#88cdce")
ax.set_title("Seasonal crashes on 528")
/tmp/ipykernel_207/2423536550.py:1: DtypeWarning: Columns (33,45) have mixed types. Specify dtype option on import or set low_memory=False. rawcrashes = pd.read_csv(r'Maryland_Statewide_Vehicle_Crashes.csv')
Text(0.5, 1.0, 'Seasonal crashes on 528')
The seasonal variation on 528 raised some questions about the other outliers, so I did the same thing to them to figure out if they have seasonal variation like 528 does.
fig, (ax1,ax2) = plt.subplots(1,2)
crashes495 = rawcrashes[pd.notnull(rawcrashes['LATITUDE'])]
crashes495 = crashes495[pd.notnull(crashes495['LONGITUDE'])]
crashes495 = crashes495[crashes495['RTE_NO']==495]
crashes3 = rawcrashes[pd.notnull(rawcrashes['LATITUDE'])]
crashes3 = crashes3[pd.notnull(crashes3['LONGITUDE'])]
crashes3 = crashes3[crashes3['RTE_NO']==3]
seasonalCrashes495 = crashes495['QUARTER'].value_counts()
seasonalCrashes3 = crashes3['QUARTER'].value_counts()
ax1.bar(seasonalCrashes495.index,seasonalCrashes495.values,color = "#88cdce")
ax2.bar(crashes3['QUARTER'].value_counts().index,crashes3['QUARTER'].value_counts().values,color = "#88cdce")
ax1.set_title("Seasonal crashes on 495")
ax2.set_title("Seasonal crashes on 3")
Text(0.5, 1.0, 'Seasonal crashes on 3')
We can see the other crashes aren't much affected by season. Route 3 and 495 both have slight variations in crashes per quarter, but nowhere near the amount that 528 did.
From this we can draw a few conclusions.
Qualities of dangerous roads: The most dangerous roads in Maryland are the roads closest to population centers. Especially dangerous are small, low capacity roads like Route 3 or 528.
How does Route one fare? Route One, when put into context does not have significantly more crashes than one would expect. The best predictor for which roads will have the most is how close the roads are to populations and for how long they are near those population centers. Near Baltimore and near DC are the most dangerous locations in Maryland generally.
Seasonally affected roads: The number of crashes on 528 is significantly affected by the seasons due to proximity to ocean city.
Most dangerous road: We can now definitively say that in terms of crashes per mile, 495 is the most dangerous road in Maryland.